The White Wine Data Set
This data set contains 4898 observations of 12 variables (13 if you count X, which just appears to be another index). All variables are quantitative measures of properties of white wines. The variable ‘quality’ is of the type integer, since quality ratings only appear as whole numbers. Every other column is numeric.
Quality values show a fairly normal distribution, with values ranging between 3 and 9, and most values falling between 5 and 7. The mode score is 6.
The pH values in the data set appear to be normally distributed, with most values falling bewteen 2.9 and 3.5.
The max density value is 1.0390, well beyond the bulk of the data which is between .99 and 1.00. The data is otherwise normally distributed.
Alcohol content appears to be slightly right skewed, with more wines having content between 9 and 10.5 versus 11 and 13.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Fixed Acidity also has a normal distribution with a max value that lies very far away from the mean.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual sugar is right skewed, with most values being on the lower end (75% fall below 9.9), but the maximum being all the way at 65.8.
All of the variables, aside from the quality rating, appear to be chemical properties of wine. Some of the variables are normally distributed, while others are left or right skewed. A few of the variables have outliers, such as fixed acidity, density, and residual sugar.
Quality and alcohol content are two of the most accessible and easy to understand variables, so it will be interesting to look at what chemical properties are associated with changes in these variables.
Residual sugar and fixed acidity affect the taste of wine - high residual sugar values make for a sweet wine, and high fixed acidity values indicate a sourness or tartness. It will be interesting to compare the two and see what other properties go along with these strong taste indicators.
I assumed there might be a negative relationship between residual sugar and fixed acidity since they have different effects on the taste of wine, but this does not appear to be the case. There is not a strong relationship between the two variables. The ggpairs grid shows a correlation coefficient of .089.
Here we see a moderately strong negative relationship. The correlation coefficient between alcohol and residual sugar is -.451, meaning that there is a somewhat downward trend in residual sugar as alcohol content increases. This makes sense, as from what I understand, the sugar in grapes is what is fermented into alcohol.
There’s a weak negative relationship here (correlation of -.121), which is quite a bit weaker than the relationship between residual sugar and alcohol. Let’s see if we can find a variable that has a stronger relationship to fixed acidity.
According to the ggpairs grid of correlation coefficients, fixed acidity is most strongly related to pH, with a correlation coefficient of -.426. It makes sense that wines with lower pHs also have lower fixed acidity, since a low pH indicates acidity.
There is a strong negative relationship between density and alcohol: -.78. This means that wines with higer alcohol contents tend to have lower densities.
Alcohol and quality have a correlation of .436, so higher alcohol wines tend to have higher quality ratings, though there is a lot of variability for almost every score.
While alcohol and residual sugar are negatively correlated, and alcohol and quality and positively correlated, quality and residual sugar do not appear to have a strong relationship. The correlation coefficient is -.0976, indicating a weak negative relationship. From the plot, it is hard to see a trend or relationship between the variables.
## [1] 0.8389665
Density and residual sugar have a very strong positive relationship. This is the strongest correlation of any two variables in the data set. As residual sugar increases, density increases in what appears to be a linear manner.
Some variables have very strong correlations, like density and residual sugar, and density and alcohol, or moderately strong correlations, like alcohol and residual sugar. I expected residual sugar and fixed acidity to have perhaps a negative correlation, since they cause different tastes in wine (sweet versus sour/acidic), but they merely had a weak positive relationship.
##
## Calls:
## lm1: lm(formula = density ~ residual.sugar, data = wine)
## lm2: lm(formula = density ~ residual.sugar + alcohol, data = wine)
##
## ================================================
## lm1 lm2
## ------------------------------------------------
## (Intercept) 0.991*** 1.005***
## (0.000) (0.000)
## residual.sugar 0.000*** 0.000***
## (0.000) (0.000)
## alcohol -0.001***
## (0.000)
## ------------------------------------------------
## R-squared 0.704 0.907
## adj. R-squared 0.704 0.907
## sigma 0.002 0.001
## F 11636.984 23791.076
## p 0.000 0.000
## Log-likelihood 24498.873 27328.019
## Deviance 0.013 0.004
## AIC -48991.747 -54648.037
## BIC -48972.257 -54622.051
## N 4898 4898
## ================================================
Density/residual sugar correlation:
cor(wine$density, wine$residual.sugar)
Fixed acidity/quality correlation:
wine$quality <- as.numeric(wine$quality)
cor(wine$fixed.acidity, wine$quality)
Once again, density and residual sugar have a strong positive, linear relationship. You can see much more of a relationship between alcohol and density (the color changes from left to right more than it changes from top to bottom) than alcohol and residual sugar. Running the linear model, we see that the R-squared value for density and residual sugar is .704, and increases to .907 by adding alcohol to the model. These three variables are strongly related and can be predictors for one another.
Here we see that there is a large variabaility in residual sugar values, and that higher alcohol is tied to higher quality ratings as well as lower residual sugar (the green dots being clustered more in the lower right hand corner).
I explored some of the variables with the strongest correlations, including alcohol, residual sugar, quality, fixed acidity, and density. I experimented with different types of plots to see what was the most illuminating way to represent the data.
What properties are strongly associated with higher quality wines? Here we can see that wines with ratings of 3 and 4 have slightly higher median alcohol contents than those with ratings of 5, but the medians climb steadily upward from there as the quality ratings increase. It’s also clear to see from this plot that most wines fall in the 5-7 rating range.
Here we see that these three variables are related. Higher alcohol content (the black dots) is more concentrated on the lower right side of the plot, where residual sugar is lower and quality ratings are higher. Lower alcohol content is associated with lower quality ratings and higher residual sugar.
Finally, these three plots look at residual sugar, density and alcohol in three different groups: low quality wines (those receiving a rating of 3, 4, or 5), medium quality (6 or 7) and high quality (8-9).
We can see the steepest trend line in the low quality wines, meaning that they have more sugar and less density compared to the medium quality wines. The high quality wines have a few data points in the high residual sugar/ low alcohol/high density region, but they are mostly concentrated in the bottom left corner with low residual sugar/high alcohol/low density.
I was interested to learn that higher quality white wines tend to have higher alcohol contents, lower sugar contents, and be less dense than lower quality wines. However, lack of sweetness did not necessarily mean the higher quality wines were more acidic.
I was surprised to find that density and residual sugar were the most strongly related variables. Residual sugar and alcohol also have a negative relationship, so perhaps alcohol is less dense than other elements in wine.